perm filename KWIC.LES[UP,DOC] blob sn#084242 filedate 1977-03-09 generic text, type C, neo UTF8
COMMENT āŠ—   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	KWIC  --  a keyword in context program  --  L. Earnest, December 1973
C00009 ENDMK
CāŠ—;
KWIC  --  a keyword in context program  --  L. Earnest, December 1973

This program can be used to produce a concordance, index, word count,
or word  list for any given text file.   The simplest command that it
understands is:

*<source file name>

This  causes the  source file  to  be scanned  for  words, which  are
compared with  an internal dictionary of common  words.  Any that are
not in the dictionary are  considered to be "keywords".  The  program
produces  an output  file,  in this  case  called <source>.KWC,  that
contains  an alphabetized  list of  keywords, one per  line, together
with the local context and a reference to the page  and line on which
they  occur.  It  also  lists  the  number  of  occurrences  of  each
dictionary word.  A typical output might begin as follows.

                       Concordance of SIGNUP[W,LES]                        
                    275 keywords, 961 dictionary words                     

                                47 a
                                 5 about
Page Line                          ------
  5   22                 A roll of adhesive tape or electrical tape.  
                                 6 after
Page Line                          ------
  1   30  August 16 at noon in the AI Conference Room.  
                                 2 air(s)(ed)(ing)
                                 3 all
Page Line                          ------
  3   15         If you come to an ambiguous fork in the trail, preferably 
                                 1 among
  ......

Numbers appearing  just to  the left  of center  are word counts  for
dictionary  words (with  various suffixes), while  the page  and line
numbers point to the locations of keywords in the  original document.
Line numbers are counted from the top  of the page.  SOS line numbers
(if  any) are ignored, as are TV/E  directory pages,  though the page
numbering includes  the directory.   Words  beginning with  different
letters of the alphabet are placed on different output pages.

			General Command Format

The more general command format is:

*[<output file>←]<source file>[/ONLY | /ALL][/INDEX | /COUNT | /LIST]

where bracketed  elements are optional  and alternative  switches are
separated by "|".  Both source and output files must be on the disk.

All  switches may  be abbreviated  to one letter.   The  /ONLY switch
causes only keywords to be  listed in the output file (i.e.  omitting
counts of dictionary  words).  The /ALL switch  causes the dictionary
to  be ignored,  so  ALL words  are treated  as keywords.  (Beware: a
concordance produced  with the ALL  switch on  is typically about  10
times the size of the original document.

The /INDEX  switch causes  the context to  be omitted and  produces a
three-column listing of words and their original locations (page  and
line) or number of occurrences (dictionary words).  The /COUNT switch
causes word  counts only to be generated  for keywords and produces a
four-column listing of  these counts.   The /LIST  switch produces  a
raw,  seething word  list (i.e.  an  alphabetized list  of all  words
used), one  per line, with no header information, and all on one long
page.

			 Scanning Procedure

KWIC treats as a word any alphanumeric string beginning with a letter
and  possibly containing "'", "-",  or "/", but nothing  else.  Thus,
things beginning with digits are ignored.  Words hyphenated over line
boundaries are reassembled.

In order to provide as much context as possible for each keyword, the
text is "dejustified" within each paragraph, so that redundant spaces
between words are removed and successive lines are concatenated, with
a <space> replacing  the <CRLF>. A new paragraph  is assumed to begin
whenever there is  a blank  line, a  <TAB> in  column 1,  or a  <form
feed>.